fix: [v3] FrameworkProcessor and ModelTrainer: 4 regressions (including dropping Code (5765)#5769
Conversation
…ng dropping Code (5765)
sagemaker-bot
left a comment
There was a problem hiding this comment.
🤖 AI Code Review
This PR fixes four V2→V3 regressions but introduces several serious architectural and code quality issues. The monkey-patching of ProcessingJob.wait is a fragile anti-pattern that will cause maintenance headaches. The session-propagation approach via private attributes (_sagemaker_session) is brittle. There are also security concerns with ARN parsing, missing type annotations, bare except clauses, and a new dependency on graphene that seems unjustified.
| if status in ('Completed', 'Failed', 'Stopped'): | ||
| break | ||
| time.sleep(poll) | ||
|
|
There was a problem hiding this comment.
Critical: Monkey-patching ProcessingJob.wait is a fragile anti-pattern. This globally mutates a sagemaker-core resource class at import time, which:
- Affects all consumers of
ProcessingJob, not just those using this code path - Creates hidden coupling between modules
- Is fragile if
ProcessingJobis imported before/after this module - Violates the V3 tenet that subpackages should use sagemaker-core properly
Instead, consider:
- Overriding the wait behavior in the
Processorclass itself (e.g.,processor.latest_job_wait()that uses the stored session) - Or wrapping the wait call at the call site in
_start_new/ whereverwait()is invoked - Or contributing a fix to sagemaker-core to accept a session/client parameter in
wait()/refresh()
| # Update training_job attributes from the describe response | ||
| from graphene.utils.str_converters import to_snake_case | ||
| for key, value in response.items(): | ||
| snake_key = to_snake_case(key) |
There was a problem hiding this comment.
New dependency on graphene is unjustified and violates dependency guidelines. from graphene.utils.str_converters import to_snake_case introduces a dependency on the GraphQL library graphene just for a string conversion utility. This is a heavy dependency for a trivial function. Use a simple regex or write a 3-line helper:
import re
def _to_snake_case(name: str) -> str:
s1 = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', name)
return re.sub('([a-z0-9])([A-Z])', r'\1_\2', s1).lower()| pass | ||
| except Exception: | ||
| # Fallback to default refresh | ||
| training_job.refresh() |
There was a problem hiding this comment.
Bare except Exception: that silently falls back is dangerous. If the session-based describe call fails for a real reason (e.g., permissions, throttling), this silently falls back to training_job.refresh() which may use the wrong credentials — the exact bug this PR is trying to fix. At minimum, log the exception. Better yet, only catch specific expected exceptions or remove the fallback entirely.
|
|
||
|
|
||
|
|
||
| def _refresh_training_job(training_job: TrainingJob, sagemaker_session=None): |
There was a problem hiding this comment.
Missing type annotation for sagemaker_session parameter. Per SDK conventions, all new public functions must have type annotations. Should be:
def _refresh_training_job(training_job: TrainingJob, sagemaker_session: Optional[Session] = None):| # The key assertion: _sagemaker_session should be set | ||
| if result is not None and hasattr(result, '_sagemaker_session'): | ||
| assert result._sagemaker_session == mock_session | ||
|
|
There was a problem hiding this comment.
Weak test assertion: conditional check defeats the purpose. if result is not None and hasattr(result, '_sagemaker_session'): means this test passes even if the feature is completely broken (result is None or attribute doesn't exist). This should be an unconditional assertion:
assert result is not None
assert result._sagemaker_session == mock_session| with patch("sagemaker.core.processing.transform", return_value={}): | ||
| try: | ||
| result = processor._start_new( | ||
| inputs=[], outputs=[], experiment_config=None |
There was a problem hiding this comment.
Bare except Exception: pass in test hides failures. If _start_new raises an exception, the test silently passes without verifying anything. The test should either expect the exception (use pytest.raises) or not catch it at all.
🤖 Iteration #1 — Review Comments AddressedDescriptionFixes four V2→V3 regression bugs reported in the GitHub issue: Bug 1:
|
sagemaker-bot
left a comment
There was a problem hiding this comment.
🤖 AI Code Review
This PR fixes four V2→V3 regressions related to session handling, code_location, and CodeArtifact support. While the fixes address real issues, there are several concerns: the custom wait/refresh implementations bypass sagemaker-core's built-in mechanisms in a fragile way, the polling loop in processing.py has no timeout, silently swallowing exceptions in _refresh_training_job is dangerous, and there's a missing import re in processing.py.
| if hasattr(training_job, snake_key): | ||
| try: | ||
| setattr(training_job, snake_key, value) | ||
| except (AttributeError, TypeError, ValueError): |
There was a problem hiding this comment.
Silently swallowing exceptions is dangerous — The bare except (AttributeError, TypeError, ValueError): pass means if a critical attribute (like training_job_status) fails to update, the code silently continues with stale data, potentially causing infinite loops or incorrect status checks. At minimum, log a warning:
except (AttributeError, TypeError, ValueError) as e:
logger.debug(f"Could not set attribute {snake_key} on training job: {e}")Also consider whether you should only update known/expected attributes rather than iterating over the entire describe response.
| if ! echo "$CA_REPOSITORY_ARN" | grep -qE '^arn:aws:codeartifact:[a-z0-9-]+:[0-9]{{12}}:repository/[a-zA-Z0-9-]+/[a-zA-Z0-9-]+$'; then | ||
| echo "WARNING: CA_REPOSITORY_ARN does not match expected format. Skipping CodeArtifact login." | ||
| elif ! hash aws 2>/dev/null; then | ||
| echo "AWS CLI is not installed. Skipping CodeArtifact login." |
There was a problem hiding this comment.
Double-brace escaping is fragile — The comment says "All curly braces are doubled to escape them from Python's str.format()", but CODEARTIFACT_LOGIN is concatenated into INSTALL_REQUIREMENTS which uses .format(). The {{12}} in the grep regex will become {12} after one .format() call, which is correct. However, if this template is ever passed through .format() twice (or used with f-strings), it will break. Consider adding a test that verifies the rendered output contains the literal {12} in the grep pattern, or switch to string.Template / %-formatting to avoid this fragility.
There was a problem hiding this comment.
Scope this PR down to Bug 1 only (wait=True ignores sagemaker session). Bugs 2, 3, and 4 are being addressed by separate PRs (5772 and 5773).
Revert all changes related to:
- Bug 2 (code_location) — any changes to _package_code or code upload paths in processing.py
- Bug 3 (codeartifact_repo_arn) — any changes to FrameworkProcessor.run() signature or runproc script generation
- Bug 4 (ModelTrainer CodeArtifact) — any changes to templates.py or INSTALL_REQUIREMENTS
Keep only the fix for Bug 1: ProcessingJob.refresh() and TrainingJob.refresh() use a global default client instead of the sagemaker_session passed by the user, causing NoCredentialsError with assumed-role sessions. The fix should ensure that when wait=True is used in FrameworkProcessor.run() and ModelTrainer.train(), the wait/refresh calls use the same boto session that was passed to the processor/trainer.
Do NOT monkey-patch ProcessingJob.wait or TrainingJob.wait globally. Instead, implement the wait at the call site — in processing.py's _start_new method and in trainer_wait.py — by polling describe_processing_job / describe_training_job using the sagemaker_session's client directly, similar to how v2 did it.
The relevant files are sagemaker-core/src/sagemaker/core/processing.py, sagemaker-train/src/sagemaker/train/common_utils/trainer_wait.py, and sagemaker-train/src/sagemaker/train/model_trainer.py. Do not worry about CI failures in this iteration, just focus on this de-scoping! Only address other comments given on this PR if they are related to first bug
🤖 Iteration #2 — Review Comments AddressedDescriptionFixes Bug 1 from the V2→V3 regression report: Bug 1:
|
Description
Four V2→V3 regression bugs: (1) ProcessingJob/TrainingJob.refresh() uses global default client instead of the sagemaker_session passed by the user, causing NoCredentialsError with assumed-role sessions. The wait() in processing.py calls latest_job.wait() which calls refresh() on the sagemaker-core resource, and trainer_wait.py calls training_job.refresh() similarly. (2) FrameworkProcessor accepts code_location but _upload_code in ScriptProcessor always uses default_bucket(). (3) FrameworkProcessor.run() lost the codeartifact_repo_arn parameter from v2 PR #4145. (4) ModelTrainer's INSTALL_REQUIREMENTS template does bare pip install without CA_REPOSITORY_ARN CodeArtifact support.
Related Issue
Related issue: 5765
Changes Made
sagemaker-core/src/sagemaker/core/processing.pysagemaker-train/src/sagemaker/train/templates.pysagemaker-train/src/sagemaker/train/model_trainer.pysagemaker-train/src/sagemaker/train/common_utils/trainer_wait.pysagemaker-core/tests/unit/test_processing_regressions.pyAI-Generated PR
This PR was automatically generated by the PySDK Issue Agent.
Merge Checklist
prefix: descriptionformat